Exploratory data analysis(EDA) is an approach to data analysis for summarising and visualising the important characteristics of a data set. EDA is not a formal process with a strict set of rules. EDA is an important part of any data analysis because you always need to investigate the quality of your data.
Your goal during EDA is to develop an understanding of your data.
Bivariate analysis include:
Correlation
Graphical techniques include:
Box plot
Merupakan data harga berlian dan karakteristiknya. Variabel yang digunakan adalah sebagai berikut
| Variabel | Keterangan |
|---|---|
| price | price in ($US) ($326-$18,823) |
| carat | weight of the diamond (0.2-5.01) |
| cut | quality of the cut (Fair, Good, Very Good, Premium, Ideal) |
| color | diamond colour, from J (worst) to D (best) |
| clarity | a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) |
| x | length in mm (0-10.74) |
| y | width in mm (0-58.9) |
| z | depth in mm (0-31.8) |
| depth | total depth percentage = z / mean(x, y) = 2 * z / (x + y), (43-79) |
| table | width of top of diamond relative to widest point (43-95) |
data("diamonds")
datatable(diamonds)
summary(diamonds)
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
grafik berikut untuk mengecej missing values
library(naniar)
gg_miss_var(diamonds, show_pct = TRUE)
tidak ada missing values pada data diamonds
Berikut adalah korelasi antar variabel dari data diamonds
num=diamonds[c('price','carat','x','y','z','depth','table')]
corrplot(cor(num),type="full",method="square")
beberapa variabel terjadi multikolinearitas, yaitu variabel carat, x, y, z. Variabel price berkorelasi positif cukup tinggi dengan variabel carat, x, y, z.
berikut adalah histogram untuk variabel respon, price.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price, fill=..count..), binwidth = 900)+
scale_y_continuous(name = "Frequency") +
scale_x_continuous(name = " Price ($US)") +
ggtitle("Frequency histogram Price of diamonds ($US)")
Tampak histogram skew kanan, sehingga variabel price tidak berdistribusi normal. Mayoritas produsen memberikan harga sekitar 20000-4000 ($US). Sebagian besar Berlian harganya kurang dari $US 5000. Varibel Price dapat dilakukan transformasi.
berikut adalah histogram untuk variabel, carat.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat, fill=..count..), binwidth = 0.4)+
scale_y_continuous(name = "carat") +
scale_x_continuous(name = " carat of the diamond") +
ggtitle("Frequency histogram carat of the diamond")
Tampak histogram skew kanan, sehingga variabel carat tidak berdistribusi normal. Variabel carat bernilai antara 0,2 hingga 5.01, tampak pada histogram carat yang nilainya lebih dari 3 tidak muncul karena frekuensinya terlalu sedikit. Dapat dilakukan penyesuaian seperti histogram berikut.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat, fill=..count..), binwidth = 0.4)+
scale_y_continuous(name = "Frequency") +
scale_x_continuous(name = " weight of the diamond") +
ggtitle("Frequency histogram weight of the diamond") +
coord_cartesian(ylim = c(0, 100))
Atau dapat juga ditabelkan seperti tabel berikut.
diamonds %>% count(cut_width(carat, 0.4))
## # A tibble: 12 x 2
## `cut_width(carat, 0.4)` n
## <fct> <int>
## 1 [0.2,0.6] 24448
## 2 (0.6,1] 11990
## 3 (1,1.4] 11093
## 4 (1.4,1.8] 4135
## 5 (1.8,2.2] 1812
## 6 (2.2,2.6] 395
## 7 (2.6,3] 35
## 8 (3,3.4] 22
## 9 (3.4,3.8] 4
## 10 (3.8,4.2] 4
## 11 (4.2,4.6] 1
## 12 (5,5.4] 1
Berikut adalah Boxplot untuk setiap variabel numerik
require(reshape2)
ggplot(data = melt(num[,-1]), aes(x=variable, y=value)) +
geom_boxplot() +
facet_wrap(~variable, scales='free')
juga dapat ditampilkan Boxplot untuk variabel kategorik
kat=diamonds[c('cut','color','clarity','price')]
ggplot(data = kat, aes(x=cut, y=price)) +
geom_boxplot()